[CORE-5696] Have a deduplicating job worker #18

skykanin · 2023-04-11T10:55:06Z

Implement duplicating job worker option in consumers

jsynacek

Note that this is a public library. I'd say we shouldn't use Jira references here. It's better to link from the internal sources to the public ones.

.gitignore

test/Test.hs

src/Database/PostgreSQL/Consumers/Config.hs

jsynacek

Looks good to me, apart from the .gitignore changes.

You might still want to wait for the final review from @arybczak , because I don't really know the code that well.

arybczak · 2023-04-27T08:54:34Z

I'd like to not merge this until @matobet adjusts his PR using this one to (double) verify that it works as expected.

arybczak · 2023-05-09T11:01:15Z

src/Database/PostgreSQL/Consumers/Components.hs

        ]
      where
+        operator = case ccMode of
+          Standard -> "="
+          Duplicating _field -> "<="


It would be better to change the type signature of updateJobs so that it takes only a single (idx, Result) in the deduplicating case 🤔 The problem now is that if something goes awry and multiple ids are passed here, the condition id <= ANY (...) will wreak havoc.

Also, how were tests passing with this bug? :) They should be updated accordingly.

src/Database/PostgreSQL/Consumers/Config.hs

Ensure that only one job is processed at a time in deduplicating mode so that we don't accidentally update multiple jobs using the <= operator when checking for matching ids to update rows in the database table.

skykanin · 2023-06-13T11:25:10Z

src/Database/PostgreSQL/Consumers/Config.hs


 -- | Result of processing a job.
 data Result = Ok Action | Failed Action
  deriving (Eq, Ord, Show)

+-- | The mode the consumer will run in.
+data Mode = Standard | Duplicating (RawSQL ())


Should we allow deduplicating on the primary key row of the jobs table like ccMode = Duplicating "id"? If you do this right now you get an ambiguity error in the sql query used for reserving jobs in the consumer because part of the query used in reserveJobs becomes SELECT id, id ...

My approach would be not to do that if it's not necessary for proper function now. It can be added later if there's a need for it. And maybe document it somewhere that you can't deduplicate based on fields that are called id.

jsynacek · 2023-06-14T13:33:38Z

src/Database/PostgreSQL/Consumers/Components.hs

@@ -269,7 +273,7 @@ spawnDispatcher ConsumerConfig{..} cs cid semaphore

      return (batchSize > 0)

-    reserveJobs :: Int -> m ([job], Int)
+    reserveJobs :: Int -> m (Either job [job], Int)


The Either is quite confusing. Could you add a comment about the idea behind it so that people don't have to figure out what it's supposed to mean? I wonder if a simple data type wouldn't be better here.

And by "better" I mean more readable...

src/Database/PostgreSQL/Consumers/Components.hs

jsynacek · 2023-06-14T13:41:05Z

src/Database/PostgreSQL/Consumers/Config.hs


 -- | Result of processing a job.
 data Result = Ok Action | Failed Action
  deriving (Eq, Ord, Show)

+-- | The mode the consumer will run in.
+data Mode = Standard | Duplicating (RawSQL ())


My approach would be not to do that if it's not necessary for proper function now. It can be added later if there's a need for it. And maybe document it somewhere that you can't deduplicate based on fields that are called id.

test/Test.hs

zlondrej · 2023-06-14T15:19:43Z

src/Database/PostgreSQL/Consumers/Config.hs


 -- | Result of processing a job.
 data Result = Ok Action | Failed Action
  deriving (Eq, Ord, Show)

+-- | The mode the consumer will run in.
+data Mode = Standard | Duplicating (RawSQL ())


Duplicating should probably be a non-empty array of expressions as one may want to be able to de-duplicate on more than one expression. And the SQL expression type should be just SQL and not RawSQL ().

RawSQL () is fine, it's for "sql literals", i.e. values that can't hold parameters.

zlondrej · 2023-06-14T15:37:08Z

src/Database/PostgreSQL/Consumers/Components.hs

+            , "   ORDER BY run_at," <+> raw field <> ", id DESC LIMIT 1 FOR UPDATE SKIP LOCKED),"
+            , "   lock_all AS"
+            , "   (SELECT id," <+> raw field <+> "FROM" <+> raw ccJobsTable
+            , "   WHERE" <+> raw field <+> "= (SELECT" <+> raw field <+> "FROM latest_for_id)"


Doesn't this almost entirely ignores run_at column? There's no run_at <= <?> now, so this would just process any job, even ones scheduled in the future, but even if the conditions was set, the de-duplicating job worker would not be very efficient at de-duplicating if jobs were scheduled into the future or when ccNotificationChannel is set.

Good catch. The mode should lock the group of jobs with the same deduplication id (dedId) that are scheduled to be processed with run_at <= now. If there are other jobs with this dedId scheduled for the future, they should be left alone.

The other problem here is not looking at the reserved_by column. However, introduction of reserved_by check like in the standard case doesn't fully solve the issue because even if a job is still being processed, there might be another row inserted after it started with the same dedId. And now it will be started in parallel to the old one and there's going to be a race :/

Ok, sorry, I feel like i'm arriving after the party…
so for what i see:

nothing is preventing us from having 2 sessions doing this at almost the same time and and both thinking that they want to work on the same value of "raw field", which i suppose is the dedId mentioned elsewhere ?

if we want to be 100% sure we don't have 2 jobs working on the same dedid at the same time, there is a more direct approach: we just lock this dedid. either in memory using an advisory lock (like select pg_advisory_lock(hash(dedid)) or something like that, or we just create a table for this with this dedid as the unique column and PK. When you want to work on a dedid, you insert a record there. when you have finished, you delete it and commit. noone will be able to work on it in the meantime. advisory locks are probably better here… you can tie them to a transaction or not (better in case you want them to be freed on error for instance), and you have the "try" function variants. So maybe it would be simpler to just:

find a candidate, get its deduplication id (the select for update skip lock should help us parallelize it), and try locking this dedid in shared memory. then the rest probably becomes simpler and more error proof ?

BTW, maybe i misunderstood how this work, i didn't look at the haskell code all around

src/Database/PostgreSQL/Consumers/Config.hs

arybczak · 2023-06-20T19:47:09Z

src/Database/PostgreSQL/Consumers/Components.hs

+            , "   ORDER BY run_at," <+> raw field <> ", id DESC LIMIT 1 FOR UPDATE SKIP LOCKED),"
+            , "   lock_all AS"
+            , "   (SELECT id," <+> raw field <+> "FROM" <+> raw ccJobsTable
+            , "   WHERE" <+> raw field <+> "= (SELECT" <+> raw field <+> "FROM latest_for_id)"


Good catch. The mode should lock the group of jobs with the same deduplication id (dedId) that are scheduled to be processed with run_at <= now. If there are other jobs with this dedId scheduled for the future, they should be left alone.

The other problem here is not looking at the reserved_by column. However, introduction of reserved_by check like in the standard case doesn't fully solve the issue because even if a job is still being processed, there might be another row inserted after it started with the same dedId. And now it will be started in parallel to the old one and there's going to be a race :/

arybczak · 2023-06-20T19:48:39Z

src/Database/PostgreSQL/Consumers/Components.hs

+        -- on which @'ccMode'@ the consumer is running in.
+        limitJobs = case ccMode of
+          Standard -> Right
+          Duplicating _field -> Left . head


This doesn't seem right.

You're picking the first job from the list and assume that it was the one with the highest id later in updateJob, but why? The query above doesn't sort on the id field. But even if you take the highest one, it's not guaranteed that you want to update all jobs with a lower id later (once looking at run_at is fixed in the reservedJobs query).

Uhh, this looks to be more complicated than I first thought it will be (even more so considering my other comment below).

skykanin added 7 commits April 21, 2023 12:43

implement deduplication configuration

36fe363

refactor code

1999544

update .gitignore file

7813acd

update consumers example

49e11d5

implement duplication test

9dea97c

rename: deduplication -> duplication

3bc7f16

update CI GHC version: 9.2.3 -> 9.2.7

1cf9f5a

skykanin force-pushed the deduplicating-job-worker branch from 11ec1f6 to 1cf9f5a Compare April 21, 2023 10:50

skykanin marked this pull request as ready for review April 21, 2023 12:28

skykanin requested review from arybczak and jsynacek April 21, 2023 12:29

jsynacek requested changes Apr 24, 2023

View reviewed changes

.gitignore Show resolved Hide resolved

test/Test.hs Outdated Show resolved Hide resolved

test/Test.hs Outdated Show resolved Hide resolved

test/Test.hs Outdated Show resolved Hide resolved

src/Database/PostgreSQL/Consumers/Config.hs Outdated Show resolved Hide resolved

fixup

641c500

skykanin requested a review from jsynacek April 26, 2023 08:55

jsynacek reviewed Apr 26, 2023

View reviewed changes

skykanin added 2 commits April 28, 2023 13:34

update log-base constraint

95e7312

Fix updateJobs deduplication implementation

866ed7d

skykanin force-pushed the deduplicating-job-worker branch from b5efb34 to 866ed7d Compare May 9, 2023 08:00

arybczak reviewed May 9, 2023

View reviewed changes

src/Database/PostgreSQL/Consumers/Config.hs Outdated Show resolved Hide resolved

skykanin added 8 commits May 26, 2023 15:50

Fix job processing in deduplicating mode

bd30a0b

Ensure that only one job is processed at a time in deduplicating mode so that we don't accidentally update multiple jobs using the <= operator when checking for matching ids to update rows in the database table.

Remove debugging log statements

c6684ca

Fix job deletion logic

f47bd3a

code cleanup

e5f1185

fix tests

02f74b4

Merge branch 'master' into deduplicating-job-worker

7912318

fix merge mistake in dep constraints

9a61a9b

fix comments

83828a8

skykanin commented Jun 13, 2023

View reviewed changes

skykanin added 2 commits June 13, 2023 13:42

revert gitignore changes

d1d8e77

remove comment

f454220

skykanin requested review from jsynacek and arybczak June 14, 2023 10:48

jsynacek requested changes Jun 14, 2023

View reviewed changes

zlondrej requested changes Jun 14, 2023

View reviewed changes

skykanin added 3 commits June 16, 2023 16:05

fixup

cd5ccc2

fix haddock comments

f67a52c

throw error when duplicating on 'id' field

b3a6761

arybczak reviewed Jun 20, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CORE-5696] Have a deduplicating job worker #18

[CORE-5696] Have a deduplicating job worker #18

skykanin commented Apr 11, 2023 •

edited

Loading

jsynacek left a comment

jsynacek left a comment

arybczak commented Apr 27, 2023 •

edited

Loading

arybczak May 9, 2023

arybczak May 9, 2023

skykanin Jun 13, 2023

jsynacek Jun 14, 2023

jsynacek Jun 14, 2023

jsynacek Jun 14, 2023

jsynacek Jun 14, 2023

zlondrej Jun 14, 2023

arybczak Jun 20, 2023

zlondrej Jun 14, 2023

arybczak Jun 20, 2023

marco44 Jul 5, 2023 •

edited

Loading

arybczak Jun 20, 2023

arybczak Jun 20, 2023

[CORE-5696] Have a deduplicating job worker #18

Are you sure you want to change the base?

[CORE-5696] Have a deduplicating job worker #18

Conversation

skykanin commented Apr 11, 2023 • edited Loading

jsynacek left a comment

Choose a reason for hiding this comment

jsynacek left a comment

Choose a reason for hiding this comment

arybczak commented Apr 27, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

marco44 Jul 5, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

skykanin commented Apr 11, 2023 •

edited

Loading

arybczak commented Apr 27, 2023 •

edited

Loading

marco44 Jul 5, 2023 •

edited

Loading